{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "keras.ipynb",
      "provenance": [],
      "private_outputs": true,
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Tce3stUlHN0L"
      },
      "source": [
        "##### Copyright 2019 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "tuOe1ymfHZPu",
        "colab": {}
      },
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "MfBg1C5NB3X0"
      },
      "source": [
        "# Keras 的分布式训练"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "r6P32iYYV27b"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://tensorflow.google.cn/tutorials/distribute/keras\"><img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\" />在 tensorflow.google.cn 上查看</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs/blob/master/site/zh-cn/tutorials/distribute/keras.ipynb\"><img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\" />在 Google Colab 运行</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/docs/blob/master/site/zh-cn/tutorials/distribute/keras.ipynb\"><img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\" />在 Github 上查看源代码</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://storage.googleapis.com/tensorflow_docs/docs/site/zh-cn/tutorials/distribute/keras.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\" />下载此 notebook</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GEe3i16tQPjo",
        "colab_type": "text"
      },
      "source": [
        "Note: 我们的 TensorFlow 社区翻译了这些文档。因为社区翻译是尽力而为， 所以无法保证它们是最准确的，并且反映了最新的\n",
        "[官方英文文档](https://www.tensorflow.org/?hl=en)。如果您有改进此翻译的建议， 请提交 pull request 到\n",
        "[tensorflow/docs](https://github.com/tensorflow/docs) GitHub 仓库。要志愿地撰写或者审核译文，请加入\n",
        "[docs-zh-cn@tensorflow.org Google Group](https://groups.google.com/a/tensorflow.org/forum/#!forum/docs-zh-cn)。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "xHxb-dlhMIzW"
      },
      "source": [
        "## 概述\n",
        "\n",
        "`tf.distribute.Strategy` API 提供了一个抽象的 API ，用于跨多个处理单元（processing units）分布式训练。它的目的是允许用户使用现有模型和训练代码，只需要很少的修改，就可以启用分布式训练。\n",
        "\n",
        "本教程使用 `tf.distribute.MirroredStrategy`，这是在一台计算机上的多 GPU（单机多卡）进行同时训练的图形内复制（in-graph replication）。事实上，它会将所有模型的变量复制到每个处理器上，然后，通过使用 [all-reduce](http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/) 去整合所有处理器的梯度（gradients），并将整合的结果应用于所有副本之中。\n",
        "\n",
        "`MirroredStategy` 是 tensorflow 中可用的几种分发策略之一。 您可以在 [分发策略指南](../../guide/distribute_strategy.ipynb) 中阅读更多分发策略。\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "MUXex9ctTuDB"
      },
      "source": [
        "### Keras API\n",
        "\n",
        "这个例子使用 `tf.keras` API 去构建和训练模型。 关于自定义训练模型，请参阅 [tf.distribute.Strategy with training loops](training_loops.ipynb) 教程。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Dney9v7BsJij"
      },
      "source": [
        "## 导入依赖"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "r8S3ublR7Ay8",
        "colab": {}
      },
      "source": [
        "from __future__ import absolute_import, division, print_function, unicode_literals\n",
        "\n",
        "# 导入 TensorFlow 和 TensorFlow 数据集\n",
        "try:\n",
        "  # %tensorflow_version 只存在于 Colab。\n",
        "  %tensorflow_version 2.x\n",
        "except Exception:\n",
        "  pass\n",
        "import tensorflow_datasets as tfds\n",
        "import tensorflow as tf\n",
        "tfds.disable_progress_bar()\n",
        "\n",
        "import os"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hXhefksNKk2I"
      },
      "source": [
        "## 下载数据集"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "OtnnUwvmB3X5"
      },
      "source": [
        "下载 MNIST 数据集并从 [TensorFlow Datasets](https://tensorflow.google.cn/datasets) 加载。 这会返回 `tf.data` 格式的数据集。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "lHAPqG8MtS8M"
      },
      "source": [
        "将 `with_info` 设置为 `True` 会包含整个数据集的元数据,其中这些数据集将保存在 `info` 中。 除此之外，该元数据对象包括训练和测试示例的数量。 \n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "iXMJ3G9NB3X6",
        "colab": {}
      },
      "source": [
        "datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)\n",
        "\n",
        "mnist_train, mnist_test = datasets['train'], datasets['test']"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "GrjVhv-eKuHD"
      },
      "source": [
        "## 定义分配策略"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TlH8vx6BB3X9"
      },
      "source": [
        "创建一个 `MirroredStrategy` 对象。这将处理分配策略，并提供一个上下文管理器（`tf.distribute.MirroredStrategy.scope`）来构建你的模型。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "4j0tdf4YB3X9",
        "colab": {}
      },
      "source": [
        "strategy = tf.distribute.MirroredStrategy()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "cY3KA_h2iVfN",
        "colab": {}
      },
      "source": [
        "print('Number of devices: {}'.format(strategy.num_replicas_in_sync))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "lNbPv0yAleW8"
      },
      "source": [
        "## 设置输入管道（pipeline）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "psozqcuptXhK"
      },
      "source": [
        "在训练具有多个 GPU 的模型时，您可以通过增加批量大小（batch size）来有效地使用额外的计算能力。通常来说，使用适合 GPU 内存的最大批量大小（batch size），并相应地调整学习速率。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "p1xWxKcnhar9",
        "colab": {}
      },
      "source": [
        "# 您还可以执行 info.splits.total_num_examples 来获取总数\n",
        "# 数据集中的样例数量。\n",
        "\n",
        "num_train_examples = info.splits['train'].num_examples\n",
        "num_test_examples = info.splits['test'].num_examples\n",
        "\n",
        "BUFFER_SIZE = 10000\n",
        "\n",
        "BATCH_SIZE_PER_REPLICA = 64\n",
        "BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "0Wm5rsL2KoDF"
      },
      "source": [
        "0-255 的像素值， [必须标准化到 0-1 范围](https://en.wikipedia.org/wiki/Feature_scaling)。在函数中定义标准化。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "Eo9a46ZeJCkm",
        "colab": {}
      },
      "source": [
        "def scale(image, label):\n",
        "  image = tf.cast(image, tf.float32)\n",
        "  image /= 255\n",
        "\n",
        "  return image, label"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "WZCa5RLc5A91"
      },
      "source": [
        "将此功能应用于训练和测试数据，随机打乱训练数据，并[批量训练](https://tensorflow.google.cn/api_docs/python/tf/data/Dataset#batch)。 请注意，我们还保留了训练数据的内存缓存以提高性能。\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "gRZu2maChwdT",
        "colab": {}
      },
      "source": [
        "train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)\n",
        "eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "4xsComp8Kz5H"
      },
      "source": [
        "## 生成模型"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "1BnQYQTpB3YA"
      },
      "source": [
        "在 `strategy.scope` 的上下文中创建和编译 Keras 模型。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "IexhL_vIB3YA",
        "colab": {}
      },
      "source": [
        "with strategy.scope():\n",
        "  model = tf.keras.Sequential([\n",
        "      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),\n",
        "      tf.keras.layers.MaxPooling2D(),\n",
        "      tf.keras.layers.Flatten(),\n",
        "      tf.keras.layers.Dense(64, activation='relu'),\n",
        "      tf.keras.layers.Dense(10, activation='softmax')\n",
        "  ])\n",
        "\n",
        "  model.compile(loss='sparse_categorical_crossentropy',\n",
        "                optimizer=tf.keras.optimizers.Adam(),\n",
        "                metrics=['accuracy'])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "8i6OU5W9Vy2u"
      },
      "source": [
        "## 定义回调（callback）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "YOXO5nvvK3US"
      },
      "source": [
        "这里使用的回调（callbacks）是：\n",
        "\n",
        "*   *TensorBoard*: 此回调（callbacks）为 TensorBoard 写入日志，允许您可视化图形。\n",
        "*   *Model Checkpoint*: 此回调（callbacks）在每个 epoch 后保存模型。\n",
        "*   *Learning Rate Scheduler*: 使用此回调（callbacks），您可以安排学习率在每个 epoch/batch 之后更改。\n",
        "\n",
        "为了便于说明，添加打印回调（callbacks）以在笔记本中显示*学习率*。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "A9bwLCcXzSgy",
        "colab": {}
      },
      "source": [
        "# 定义检查点（checkpoint）目录以存储检查点（checkpoints）\n",
        "\n",
        "checkpoint_dir = './training_checkpoints'\n",
        "# 检查点（checkpoint）文件的名称\n",
        "checkpoint_prefix = os.path.join(checkpoint_dir, \"ckpt_{epoch}\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "wpU-BEdzJDbK",
        "colab": {}
      },
      "source": [
        "# 衰减学习率的函数。\n",
        "# 您可以定义所需的任何衰减函数。\n",
        "def decay(epoch):\n",
        "  if epoch < 3:\n",
        "    return 1e-3\n",
        "  elif epoch >= 3 and epoch < 7:\n",
        "    return 1e-4\n",
        "  else:\n",
        "    return 1e-5"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "jKhiMgXtKq2w",
        "colab": {}
      },
      "source": [
        "# 在每个 epoch 结束时打印LR的回调（callbacks）。\n",
        "class PrintLR(tf.keras.callbacks.Callback):\n",
        "  def on_epoch_end(self, epoch, logs=None):\n",
        "    print('\\nLearning rate for epoch {} is {}'.format(epoch + 1,\n",
        "                                                      model.optimizer.lr.numpy()))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "YVqAbR6YyNQh",
        "colab": {}
      },
      "source": [
        "callbacks = [\n",
        "    tf.keras.callbacks.TensorBoard(log_dir='./logs'),\n",
        "    tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,\n",
        "                                       save_weights_only=True),\n",
        "    tf.keras.callbacks.LearningRateScheduler(decay),\n",
        "    PrintLR()\n",
        "]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "70HXgDQmK46q"
      },
      "source": [
        "## 训练和评估"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "6EophnOAB3YD"
      },
      "source": [
        "在该部分，以普通的方式训练模型，在模型上调用 `fit` 并传入在教程开始时创建的数据集。 无论您是否分布式训练，此步骤都是相同的。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "7MVw_6CqB3YD",
        "colab": {}
      },
      "source": [
        "model.fit(train_dataset, epochs=12, callbacks=callbacks)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "NUcWAUUupIvG"
      },
      "source": [
        "如下所示，检查点（checkpoint）将被保存。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "JQ4zeSTxKEhB",
        "colab": {}
      },
      "source": [
        "# 检查检查点（checkpoint）目录\n",
        "!ls {checkpoint_dir}"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "qor53h7FpMke"
      },
      "source": [
        "要查看模型的执行方式，请加载最新的检查点（checkpoint）并在测试数据上调用 `evaluate` 。\n",
        "\n",
        "使用适当的数据集调用 `evaluate` 。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "JtEwxiTgpQoP",
        "colab": {}
      },
      "source": [
        "model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))\n",
        "\n",
        "eval_loss, eval_acc = model.evaluate(eval_dataset)\n",
        "\n",
        "print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "IIeF2RWfYu4N"
      },
      "source": [
        "要查看输出，您可以在终端下载并查看 TensorBoard 日志。\n",
        "\n",
        "```\n",
        "$ tensorboard --logdir=path/to/log-directory\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "LnyscOkvKKBR",
        "colab": {}
      },
      "source": [
        "!ls -sh ./logs"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "kBLlogrDvMgg"
      },
      "source": [
        "## 导出到 SavedModel"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Xa87y_A0vRma"
      },
      "source": [
        "将图形和变量导出为与平台无关的 SavedModel 格式。 保存模型后，可以在有或没有 scope 的情况下加载模型。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "h8Q4MKOLwG7K",
        "colab": {}
      },
      "source": [
        "path = 'saved_model/'"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "4HvcDmVsvQoa",
        "colab": {}
      },
      "source": [
        "tf.keras.experimental.export_saved_model(model, path)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "vKJT4w5JwVPI"
      },
      "source": [
        "在无需 `strategy.scope` 加载模型。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "T_gT0RbRvQ3o",
        "colab": {}
      },
      "source": [
        "unreplicated_model = tf.keras.experimental.load_from_saved_model(path)\n",
        "\n",
        "unreplicated_model.compile(\n",
        "    loss='sparse_categorical_crossentropy',\n",
        "    optimizer=tf.keras.optimizers.Adam(),\n",
        "    metrics=['accuracy'])\n",
        "\n",
        "eval_loss, eval_acc = unreplicated_model.evaluate(eval_dataset)\n",
        "\n",
        "print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "YBLzcRF0wbDe"
      },
      "source": [
        "在含 `strategy.scope` 加载模型。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "BBVo3WGGwd9a",
        "colab": {}
      },
      "source": [
        "with strategy.scope():\n",
        "  replicated_model = tf.keras.experimental.load_from_saved_model(path)\n",
        "  replicated_model.compile(loss='sparse_categorical_crossentropy',\n",
        "                           optimizer=tf.keras.optimizers.Adam(),\n",
        "                           metrics=['accuracy'])\n",
        "\n",
        "  eval_loss, eval_acc = replicated_model.evaluate(eval_dataset)\n",
        "  print ('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "MUZwaz4AKjtD"
      },
      "source": [
        "### 示例和教程\n",
        "以下是使用 keras fit/compile 分布式策略的一些示例：\n",
        "1. 使用`tf.distribute.MirroredStrategy` 训练 [Transformer](https://github.com/tensorflow/models/blob/master/official/transformer/v2/transformer_main.py) 的示例。\n",
        "2. 使用`tf.distribute.MirroredStrategy` 训练 [NCF](https://github.com/tensorflow/models/blob/master/official/recommendation/ncf_keras_main.py) 的示例。\n",
        "\n",
        "[分布式策略指南](../../guide/distribute_strategy.ipynb#examples_and_tutorials)中列出的更多示例 "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "8uNqWRdDMl5S"
      },
      "source": [
        "## 下一步\n",
        "\n",
        "* 阅读[分布式策略指南](../../guide/distribute_strategy.ipynb)。\n",
        "* 阅读[自定义训练的分布式训练](training_loops.ipynb)教程。\n",
        "\n",
        "注意：`tf.distribute.Strategy` 正在积极开发中，我们将在不久的将来添加更多示例和教程。欢迎您进行尝试。我们欢迎您通过[ GitHub 上的 issue ](https://github.com/tensorflow/tensorflow/issues/new) 提供反馈。"
      ]
    }
  ]
}